14.3 Classify Iris Flowers Using Naïve Bayes
Download the file from the link below to follow along with the text example or video and to practice on your own.
A very widely used dataset example to learn classification is the classification of Iris flowers using four specific features of different categories of Irises. The features that are measured are sepal length, sepal width, petal length, and petal width. The sepal is the outer covering of a flower, which is usually green and especially noticeable when the flower is still in a bud. The petals are the parts of the flower that are colored and are visible after the flower bud matures and opens up.
The training dataset is made up of 150 instances of three categories of Iris flowers: Iris setosa, Iris versicolor, and Iris virginica. In the dataset, the features are given in centimeters. Since numeric values are continuous, we first must categorize the features into discrete values. In this case we have categorized the measurements into "short," "medium," and "long" values. This data comes from a famous paper by R.A. Fisher.1 The goal of this procedure is to categorize a set of plants into their correct category of Iris flower.
Because the steps are the same for each type of Iris, we have not included figures for all three types. You can view all of the data by downloading the resource file for this section.
Calculating the Likelihoods for Each Type of Iris
The technique for predicting the category of Iris is similar to the one we explained earlier. However, the specific steps are slightly different due to the different format of the data. In the Iris example, there are 150 line items, each with the set of four characteristics for each type of Iris. The first step is to tokenize the characteristics and then calculate the likelihood for each characteristic.
First, we divide the data into training data and data to be tested. We randomly select 20 items to be the data to be tested, which leaves 130 items to serve as training data. Figure 14.4 shows the original dataset, with 20 items randomly chosen to be test data. In the figure, the rand() function assigns a random number to each line. Then the include column has a value of 1 or 0 set depending on a cutoff above A1, which in this case was 0.91. (For example, =(IF(B4>$A$1,1,0) .) We manipulated the cutoff so that only 20 rows were selected to give the test data. The other 130 rows will be used for training data. The test data is highlighted in yellow.
We will deviate from the standard Naïve Bayes theorem slightly in this example. Because the three categories that will be tested all have the same denominator in the equation, we will skip part of the calculations. The denominator, p(token), is calculated from the likelihoods for the tokens for the entire dataset. And because we only need the relative relationship between the types of Irises, we will eliminate the p(token) denominator from the comparison equations. Thus, the next step is to tokenize the characteristics for each of the three types of Irises. Figure 14.5 shows the tokens for the Setosa type of Iris. To do this we sorted the training dataset by type of Iris, and then selected the values for each characteristic for all of the Setosa Irises. We simply cut and pasted the values for each characteristic into this single column token list. We do the same for the Versicolor and Virginica Irises.
In preparation for calculating the likelihood for each token for the Setosa Iris, we create a PivotTable that counts the tokens and gives a grand total. Figure 14.6 illustrates this PivotTable.
Next we transfer the information from the PivotTable to a new table to calculate the likelihood for each token. Figure 14.7 illustrates this step. As can be seen, there are several tokens that did not appear in the Setosa dataset. We smooth out the data by adding 0.1 to each count value. The likelihoods are calculated by dividing each count by the grand total in cell F17, as shown by the formula. Also, in this example, since we only have four characteristics (types of tokens), we will multiply the likelihoods directly and not use the logarithms.
Predicting the Probabilities of Irises
The final step is to calculate the probability of each type of Iris in the test data that we extracted earlier. The process is to calculate the probability of each test data item based on the four characteristics (tokens) that were calculated previously. Then we compare the three values, and the type of Iris with the highest probability is the predicted Iris type. Again, the calculation that we use is a simplified Naïve Bayes equation, but without the denominator. To get the probabilities for each type of Iris, we again use the VLOOKUP function to find the token and its likelihood. The basic form of the calculation is the following, where tokens are the SLength, SWidth, etc.:
Figure 14.8 illustrates this last step. Row 1 contains the likelihoods for each type of Iris. The calculation is the number of occurrences of Iris type divided by 130, the total number of Irises in the training data. In columns A through E is the data to be tested. Column E shows the given Iris type. Our calculated prediction should match this Iris type.
Columns F, G, and H contain the probabilities for each type of Iris. Row 26 of Figure 14.8 shows an example of the equation used to calculate the probability. The example is for the Setosa type of Iris. The formula multiplies the Iris type probability times the probability for the value of each of the four types of tokens. As shown in Figure 14.7, the probability table for each type of Iris is given a name (i.e., lookupSetosa), and the VLOOKUP function is used to find the probability value for the specified token value.
Column I contains a selection formula to select the type of Iris based on the maximum value across columns F, G, and H. Column J just tests if the prediction in Column I equals the Iris type given in Column E. It uses a simple IF function [ IF(E5=I5,0,1) ].